30 research outputs found

    SAT-LHUC: Speaker adaptive training for learning hidden unit contributions

    Get PDF

    Learning Hidden Unit Contributions for Unsupervised Speaker Adaptation of Neural Network Acoustic Models

    Get PDF

    Differentiable pooling for unsupervised speaker adaptation

    Get PDF
    This paper proposes a differentiable pooling mechanism to perform model-based neural network speaker adaptation. The proposed tech-nique learns a speaker-dependent combination of activations within pools of hidden units, was shown to work well unsupervised, and does not require speaker-adaptive training. We have conducted a set of experiments on the TED talks data, as used in the IWSLT evalu-ations. Our results indicate that the approach can reduce word error rates (WERs) on standard IWSLT test sets by about 5–11 % relative compared to speaker-independent systems and was found comple-mentary to the recently proposed learning hidden units contribution (LHUC) approach, reducing WER by 6–13 % relative. Both methods were also found to work well when adapting with small amounts of unsupervised data – 10 seconds is able to decrease the WER by 5% relative compared to the baseline speaker independent system

    Neural networks for distant speech recognition

    Get PDF
    Distant conversational speech recognition is challenging ow-ing to the presence of multiple, overlapping talkers, additional non-speech acoustic sources, and the effects of reverberation. In this paper we review work on distant speech recognition, with an emphasis on approaches which combine multichan-nel signal processing with acoustic modelling, and investi-gate the use of hybrid neural network / hidden Markov model acoustic models for distant speech recognition of meetings recorded using microphone arrays. In particular we investi-gate the use of convolutional and fully-connected neural net-works with different activation functions (sigmoid, rectified linear, and maxout). We performed experiments on the AMI and ICSI meeting corpora, with results indicating that neu-ral network models are capable of significant improvements in accuracy compared with discriminatively trained Gaussian mixture models. Index Terms — convolutional neural networks, distant speech recognition, rectifier unit, maxout networks, beam-forming, meetings, AMI corpus, ICSI corpus 1

    Distant Speech Recognition Experiments Using the AMI Corpus

    Get PDF

    Differentiable Pooling for Unsupervised Acoustic Model Adaptation

    Get PDF
    We present a deep neural network (DNN) acoustic model that includes parametrised and differentiable pooling operators. Unsupervised acoustic model adaptation is cast as the problem of updating the decision boundaries implemented by each pooling operator. In particular, we experiment with two types of pooling parametrisations: learned LpL_p-norm pooling and weighted Gaussian pooling, in which the weights of both operators are treated as speaker-dependent. We perform investigations using three different large vocabulary speech recognition corpora: AMI meetings, TED talks and Switchboard conversational telephone speech. We demonstrate that differentiable pooling operators provide a robust and relatively low-dimensional way to adapt acoustic models, with relative word error rates reductions ranging from 5--20% with respect to unadapted systems, which themselves are better than the baseline fully-connected DNN-based acoustic models. We also investigate how the proposed techniques work under various adaptation conditions including the quality of adaptation data and complementarity to other feature- and model-space adaptation methods, as well as providing an analysis of the characteristics of each of the proposed approaches.Comment: 11 pages, 7 Tables, 7 Figures in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, num. 11, 201

    Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition

    Full text link
    This paper presents an extension to train end-to-end Context-Aware Transformer Transducer ( CATT ) models by using a simple, yet efficient method of mining hard negative phrases from the latent space of the context encoder. During training, given a reference query, we mine a number of similar phrases using approximate nearest neighbour search. These sampled phrases are then used as negative examples in the context list alongside random and ground truth contextual information. By including approximate nearest neighbour phrases (ANN-P) in the context list, we encourage the learned representation to disambiguate between similar, but not identical, biasing phrases. This improves biasing accuracy when there are several similar phrases in the biasing inventory. We carry out experiments in a large-scale data regime obtaining up to 7% relative word error rate reductions for the contextual portion of test data. We also extend and evaluate CATT approach in streaming applications.Comment: 5 pages, 2 figures, 2 table
    corecore